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Background / Context: 


Instrumental variables (IV) methods allow for consistent estimation of causal effects, but suffer 
from poor finite-sample properties and data availability constraints. Bound, Jaeger, and Baker 
(1995) establish that estimation with weak instruments, even under a weak relationship between 
the instrument and the error term, can lead to large inconsistencies and finite sample bias. IV 
estimates also tend to have relatively large standard errors, often inhibiting the interpretability of 
differences between IV and non-IV point estimates. Lastly, instrumental variables’ idiosyncratic 
nature reduces their availability in data sets alongside outcome and other variables of interest. 

Most prior work on two-sample IV has exploited multiple data sets with the goal of attaining 
identification. Following the independent discovery of a similar method by Klevmarken (1982), 
Angrist and Krueger (1992) propose a two-sample instrumental variables estimator that allows 
for estimation of each covariance matrix composing the IV estimator with separate data sets from 
the same population, where both data sets have an observed instrument in common, but not the 
dependent and endogenous variables. Arellano and Meghir (1992) correspondingly propose a 
method equivalent to two-sample two-stage least squares (TS2SLS) to identify a model of labor 
supply. Under the assumption that the different samples utilized are drawn from the same 
population, these estimators identify parameters of interest consistently. Given its computational 
convenience and favorable asymptotic properties (Inoue and Solon, 2005), the TS2SLS estimator 
is a natural choice for instrumental variables estimation under data combination. 

While data combination in this context can be seen as a second-best solution — reserved for when 
identification cannot be secured through a single sample — it has the potential to provide 
additional useful information to applied researchers in any scenario. Even when a parameter of 
interest is identified and consistently estimated with a single sample, data combination can be 
preferable for the mean squared error (MSE) of IV estimates of that parameter. For example, 
estimating the first-stage relationship between the endogenous regressor and the instrument(s) 
with a larger sample than would be possible with the primary data set alone can improve MSE. 
Because finite sample bias of the TS2SLS estimator depends on sampling error in first stage 
estimation, this leads to a reduction in bias and potentially an increase in efficiency of the IV 
estimate. Moreover, the use of an auxiliary data set can provide additional covariates affecting 
the outcome of interest, which can be used to increase precision. Incidentally, these additional 
covariates can also be used in evaluating the exogeneity of an instrument. Lastly, quality of 
measurement may differ between primary and auxiliary data sets, and a data set with better 
measures of the instrument may also be preferable to use to estimate the first stage. 

Purpose / Objective / Research Question / Focus of Study: 

This paper aims to explore the properties and potential applications of data combination, 
specifically through the lens of the TS2SLS estimator. The paper, in its final form, will 
demonstrate the finite sample properties of the TS2SLS estimator and provide guidelines to 
empirical researchers to identify when using auxiliary data through the TS2SLS estimator results 
in preferable estimates. This will be done analytically in a basic framework where feasible, but 
more general propositions will be argued through simulation evidence. 
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Significance / Novelty of study: 

While econometric literature has addressed data combination problems thoroughly through both 
Generalized Method of Moments (GMM) and non-parametric frameworks, little has been done 
to outline the practical scenarios under which data combination results in preferable estimates in 
the context of instrumental variables. This paper aims to outline the potential gains of and 
provide guidelines for successful implementation of the TS2SLS estimator. Specifically, the 
potential finite sample bias reduction and efficiency gains in a variety of situations can help 
researchers better estimate causal effects and test the robustness of their estimates. The data 
combination cases outlined in the next section show some examples of how data combination 
methodology can improve valid causal inference. 

Statistical, Measurement, or Econometric Model: 

Consider a simple linear system with one endogenous variable and a well-behaved error tenn: 

y 1 = YzPi + xp 2 + £ i 
y 2 = zy + s 2 
CovCs-L, £ 2 ) A 0 

E(e i|z) = 0 

EOi 2 |z) = er 2 

x and z are vectors of exogenous variables. A linear regression of y 2 on y 1 and x produces a 
biased and inconsistent estimate of /? x . A two-stage least squares procedure, in which fitted 
values are generated from a first stage regression of y 2 on z and x and then used as the 
instrument for y 2 in an IV regression of y 2 on y 1 and x, produces a consistent but biased 
estimate of 

Assume there are two data sets containing relevant covariates for this model, with sample sizes 
Ni and N 2 , respectively. The data sets are random samples from the same population. Table 1 
outlines some different stylized cases of data availability. For example, in case 1, there are Ni 
observations containing covariate values { y 1 , y 2 , zj in data set 1, and N 2 observations containing 
covariate values (y 2 , z}in data set 2. In every case (with the exception of case 0), /? x is identified 
and can be consistently estimated with data set 1 alone. The next section will provide some 
hypotheses regarding the econometric properties of combined estimation with both data sets as 
compared to using data set 1 alone, and then provide some preliminary simulation evidence for 
those claims. 

Usefulness / Applicability of Method: 

Define Ptszsls as the estimator for ^corresponding to the data combination procedure that will 
be suggested for each case. Pss,2Sls is the 2SLS estimator using data set 1 alone. Pfspsls is the 
estimator corresponding to using 2SLS in the “full-sample” hypothetical case in which all 
covariates contained in data sets 1 and 2 are observed in a single data set, with a sample size 
equal to the greater of Ni and No. Each case has an individualized procedure in order to 
efficiently use information from both data sets that is outlined in the appendix. Case 0 pertains to 
the classic motivation for data combination (e.g., Angrist & Krueger, 1992), where additional 
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data sources are necessary to estimate the relationship between the endogenous regressor and the 
instrument, and is only presented for contrast. 

Case 1 identifies a scenario in which consistent estimates can be calculated with data set 1 alone, 
but combined estimation with data set 2 would result in smaller bias and more efficient estimates 
due to data set 2’s larger sample size. Simulation results find a significant difference (t = 4.7) in 
bias, and the standard errors of Ptszsls were 1/1 0 th the size of the standard errors of @ss,2sls on 
average. 

Case 2 is similar to Case 1 , except that data set 1 also has exogenous regressors x ± affecting 
y-^that may in general be correlated with z) which must be appropriately “partialled out” of the 
instrument via application of the Frisch-Waugh-Lovell theorem. Similar methods are followed 
for the remaining cases. The simulation results provide evidence for the prediction about the bias 
made earlier: the bias of Pts 2 sls is bounded above by Bias[(l SS2SLS ] and below by 
Bias[(3 FS2SLS ], with significant differences from those bounds (t = 33.4 and t = 36.2, 
respectively). The simulation is inconclusive for efficiency predictions, either because of lack of 
power in the simulation, or because of the specific calibration. 

In Case 3, if variation in z is random and in turn Cov(z, x 2 ) = 0, then using additional covariates 
in data set 2 in estimation of the first stage should result in more precise estimates of y if 
Cov(y 2 ,x 2 ) A 0. Fitted values for data set 1 can then be generated as y 2 — zy, and resulting 
estimates of /? x should have lower MSE than if data set 1 alone were used, even though data set 1 
does not have information on x 2 . Simulation results show that the two-sample estimator 
outperforms SS2SLS in bias and efficiency, with significant differences. 

In Case 4, data set 1 has a larger set of instruments, but a smaller sample size, than data set 2. 
Infonnation from data set 2 can still be used to improve first-stage estimates in a manner similar 
to case 2. Even if the instruments are uncorrelated, this can be done controlling for the sample 
correlation between the instruments caused by sampling error, or simply implicitly imposing that 
they are uncorrelated in generating the fitted values. The simulation presented uses the partial- 
out method (in contrast, Case 3 imposed Cov(z, x 2 ) = 0, possibly explaining its efficiency gain). 
The bias results mirror Case 2’s, confirming the bias reduction of the TS2SLS estimator. 
However, TS2SLS is less efficient overall than SS2SLS. 

Case 5 appears similar to case 2, but the role of the second data set is different: it has fewer 
observations, but a richer set of covariates. In this case, data set 2 becomes the “second-stage” 
data set. TS2SLS may be preferred for efficiency purposes (thanks to the inclusion of x-l) even if 
/?! is identified and consistently estimated with data set 1 alone. Alternatively, if E(£ 1 |z) A 0 but 
E^Jz, x x ) = 0, estimates with data set 1 alone are inconsistent, but estimates using TS2SLS are 
consistent. In this scenario, the inclusion of data set 2 allows for an instrument which is only 
exogenous conditional on some vector of covariates x x to be validly used while still maximally 
exploiting the larger sample size of data set 1. Simulation results for this case are currently 
inconclusive. 

Conclusions: 
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Provided two random samples from the same population, there are multiple scenarios of 
covariate availability in two data sets that have preferable estimates using the TS2SLS-type 
procedures presented here. The primary hypothesized improvements from data combination in 
the model presented are finite sample bias reduction and efficiency gain under certain conditions. 
The efficiency gain is only achieved if the reduction in overall sampling error achieved by the 
data combination exceeds the new sampling error induced by the “partialling out” procedure 
sometimes required to combine information on the basis of common covariates (or, in other 
words, the use of imputed regressors in the first stage). For example, Case 1, which did not have 
additional covariates beyond the endogenous regressor and the instrument, has an unequivocal 
efficiency gain, but the remaining cases are susceptible to lower overall efficiency. These results 
match theoretical expectations, but are only confirmable once the finite sample properties are 
analytically derived, and lacking that, once simulations are run with a wide range of parameter 
choices. 

The chief limitations on these findings relate to the comparability of available samples. The 
common covariates used to link the data sets must measure the same thing, and may be measured 
with different levels of error. More critically, samples are rarely from the same population (and 
rarely truly random), and so optimal methods of sample reconstruction must be considered. 
Drawing on sample selection and quasi-experimental methods literature, different methods 
should be considered for augmenting auxiliary data in order to ensure the comparability of 
auxiliary and primary data (as measured by equivalence of sample moments). These include 
inverse probability weighting (Wooldridge, 2002), inverse probability tilting (Graham, Pinto, and 
Egel, 2008), imposing moment restrictions by weighting (Hellerstein and Imbens, 1999), and 
“entropy balancing” (Hainmueller, 2011). 

There are several practical situations emulating the stylized cases in Table 1 that are amenable to 
TS2SLS, with the main practical constraint being data availability. The most likely candidate 
causal questions for this approach are those which are answered with instrumental variables that 
are universally available (e.g., birth date), are easily matchable from outside sources (e.g., IVs 
related to time and/or geography), or that relate to some independently known selection rule (i.e., 
regression discontinuity designs). Educational policy evaluations with a quasi-experimental 
approach are likely to benefit, as data sets often will have information on students’ geographic 
regions, school districts, and even schools, which are easily matchable to policy variation 
affecting variables of causal interest. 

Another broader example of an application lies in instrumental variables approaches to 
estimating the average return to schooling. Angrist and Krueger (1991) use compulsory school 
attendance laws and variation in age at school entry due to school start age policies to estimate a 
relationship between years of schooling and age at school entry, and subsequently use this 
exogenous variation in years of schooling to estimate the wage return to schooling. They use 
2SLS to estimate a linear model with some controls using the 1970 Census, uniformly finding no 
significant difference between 2SLS and OLS estimates of the return to schooling. The 
combination of this Census data with a covariate-rich data set using the methods outlined here 
(i.e., Case 5), such as the Panel Study of Income Dynamics, can allow for more efficient 
estimates as more of the non-educational determinants of wages can be controlled for, even 
though they are not required for consistent estimation. 
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Appendix B. Tables and Figures 
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Table 1. Data Combination Situations 


Case 

Data set 1 

Data set 2 

Conditions 

0 

[yi, z } 

iy 2 . z } 

— 

1 

{yi> y 2 > z } 

{y 2 . z } 

Ni< N 2 

2 

{yi, y 2 > xi, z) 

{y 2 . z } 

Ni< N 2 

3 

iyi, y 2 > xi, z) 

{y 2 , x 2 , z } 

Cov(z,x 2 ) — 0 

4 

{yx.y2.zt, z 2 } 

iy 2 . zt} 

Ni< N 2 

5 

{yi.yi.z} 

{yi, y 2 . xi, z} 

Ni > N 2 , Cov(z,Xt ) = 0 


Case 1 Simulation: Data Generating Process 

Data Set 1 Sample Size (Nl) 

200 

Data Set 2 Sample Size (N2) 

4,800 

DGP 



z~JV(0,l) 
£ 2 ~N( 0,1) 
Ut~N( 0,4) 
u 2 ~N(0,l) 

&t — 2^2 ”1” tt ] 

x-t — 2 z + u 2 
y 2 — z + 4 s 2 

yi = y 2 + £ 2 




Case 1 Simulation Results (# Simulations = 200,000) 

Estimator 

Mean 

Standard 

Deviation 

Full-sample First Stage F-Statistic 

301.0804 

35.7702 

Small-sample First Stage Coefficient 

0.999422 

0.285607 

Full-sample First Stage Coefficient 

1.000047 

0.057789 

Full-sample OLS 

1.470599 

0.007218 

Small-sample 2SLS (/? ss ,2Sls) 

0.944031 

5.584372 

Hypothetical Full-sample 2SLS ( Pfs, 2 sls ) 

0.998256 

0.041193 

Two-sample 2SLS (P T s 2 sls) 

1.002295 

0.458146 
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Case 2 Simulation: Data Generating Process 

Data Set 1 Sample Size (Nl) 

3,000 

Data Set 2 Sample Size (N2) 

3,000 

DGP 



z~iV(0,l) 

£ 2 ~N( 0,1) 
Ui~iV(0,4) 
u 2 ~N(0,l) 

£ i — 2^2 ”1” 

x ± = 2 z + u 2 

y 2 — z + 2 s 2 
y i 

= y 2 + 10*! + 2£ 2 




Case 2 Simulation Results (# Simulations = 100,000) 

Estimator 

Mean 

Standard 

Deviation 

Full-sample First Stage F-Statistic 

2401.083 

109.5684 

Small-sample First Stage Coefficient 

1.00034 

0.224158 

Full-sample First Stage Coefficient 

0.999974 

0.020418 

Full-sample OLS 

1.952388 

0.010184 

Small-sample 2SLS ( 'Jss.zsls ) 

0.938835 

0.402948 

Hypothetical Full-sample 2SLS ( Pfs,2sls ) 

0.997849 

0.064915 

Two-sample 2SLS (P T s2Sls) 

0.953917 

0.409761 


Case 3 Simulation: Data Generating Process 

Data Set 1 Sample Size (Nl) 

3,000 

Data Set 2 Sample Size (N2) 

3,000 

DGP 



z~iV(0,l) 

£ 2 ~N( 0,1) 

Ui~N( 0,4) 
u 2 ~N(0,l) 

£ ! — 2£ 2 ~F U ! 
x l — 2z-\- u 2 
x 2 ~N( 0,1) 
y 2 = z + 10x 2 + 2e 2 
yi = y 2 + *i + £ 2 




SREE Spring 2012 Conference Abstract Template 


B-2 






Case 3 Simulation Results (# Simulations = 100,000) 

Estimator 

Mean 

Standard 

Deviation 

Full-sample First Stage F-Statistic 

37887.59 

1414.267 

Small-sample First Stage Coefficient 

0.99928 

0.416349 

Full-sample First Stage Coefficient 

0.999887 

0.036519 

Full-sample OLS 

1.038353 

0.005004 

Small-sample 2SLS (Pss.isls) 

0.88087 

17.59361 

Hypothetical Full-sample 2SLS ( Pfs.2Sls ) 

0.990412 

4.778995 

Two-sample 2SLS (P T s2Sls) 

1.04026 

9.535611 


Case 4 Simulation: Data Generating Process 

Data Set 1 Sample Size (Nl) 

400 

Data Set 2 Sample Size (N2) 

9,600 

DGP 



Zt~N( 0,1) 

z 2 ~N(0,l) 

£ 2 ~ N (0,1) 
u x ~N( 0,4) 
u 2 ~N(0,l) 

£ 1 — 4s 2 + U-1 
x ± ~N (0,4) 
y 2 — Z 1 + z 2 + 16 s 2 
yi = yz + £ 2 




Case 4 Simulation Results (# Simulations = 100,000) 

Estimator 

Mean 

Standard 

Deviation 

Full-sample First Stage F-Statistic 

38.3442 

12.32802 

Small-sample First Stage Coefficient 

1.004556 

0.803622 

Full-sample First Stage Coefficient 

0.999825 

0.163579 

Full-sample OLS 

1.248064 

0.001294 

2SLS on Data Set 1 Alone 

1.052799 

0.35121 

2SLS on Data Set 2 Alone 

0.999892 

0.033188 

Hypothetical Full-sample 2SLS ( Pfszsls ) 

1.0384 

0.628719 

Two-sample 2SLS (P T s2Sls) 

38.3442 

12.32802 
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Case 5a Simulation: Data Generating Process 

Data Set 1 Sample Size (Nl) 

400 

Data Set 2 Sample Size (N2) 

9,600 

DGP 



z~JV( 0,1) 

£ 2 ~N( 0,1) 
u^N( 0,4) 
u 2 ~iV(0,l) 

£ i — 2^2 ”1” ^1 

x ± ~N( 0,4) 
y 2 = z + 6f 2 
yi = y 2 + 6x x + £ 2 




Case 5a Simulation Results (# Simulations = 200,000) 

Estimator 

Mean 

Standard 

Deviation 

Full-sample First Stage F-Statistic 

267.656 

33.14328 

Small-sample First Stage Coefficient 

1.000627 

0.301397 

Full-sample First Stage Coefficient 

0.999933 

0.06124 

Full-sample OLS 

1.324331 

0.003404 

2SLS on Data Set 1 Alone 

0.998517 

0.126608 

2SLS on Data Set 2 Alone 

0.947522 

2.713303 

Hypothetical Full-sample 2SLS ( Pfs,2Sls ) 

0.998693 

0.029129 

Two-sample 2SLS (P T s2Sls) 

1.004571 

0.420684 
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Procedures 


Case 1 

1. Regress y 2 on z and get coefficient estimate y in data set 2. 

2. Generate fitted values y 2 = Y z in data set 1 

3. Use y 2 as instruments for y 2 in a regression of y 1 on y 2 . 

Case 2 

Here, the procedure is 

1. Regress y 2 on z and get coefficient estimate y in data set 2. 

Then, in data set 1 , 

2. Generate residuals e y2 = y 2 — yz 

3. Regress x 1 on z and get residuals, e z 

4. Regress e y2 on e z and calculate fitted values y 2 

5. Generate y 2 = e y2 + y 2 

6. Use y 2 as an instrument for y 2 in a regression of y x on y 2 and x x 

In a single sample, the y 2 generated by this procedure is computationally identical to the fitted 
values from OLS of y 2 on z and x x . 

Case 3 

1. Regress y 2 on z and x 2 , and get coefficient estimate y, in data set 2. 

2. Generate fitted values yj = yz in data set 1 

3. Use y 2 as instruments for y 2 in a regression of y 1 on y 2 . 

Case 4 

1. Regress y 2 on z x and get coefficient estimate y 1 in data set 2. 

Then, in data set 1 , 

2. Generate residuals e y2 = y 2 — y^ 

3. Regress z 2 on z : and get residuals, e z 

4. Regress e y2 on e z and calculate fitted values y 2 

5. Generate y 2 = e y2 + y 2 

6. Use y 2 as an instrument for y 2 in a regression of y 1 on y 2 and x x 

Case 5 

1 . Regress y 2 on z and get coefficient estimate y in data set 1 . 

Then, in data set 2, 

2. Generate residuals e y2 = y 2 — yz 

3. Regress x x on z and get residuals, e z 
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4. Regress e y2 on e z and calculate fitted values y 2 

5. Generate fa = e y2 + fa 

6. Use y 2 as an instrument for y 2 in a regression of y 1 on y 2 and x 1 
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